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Abstract 

Smartphones have become an indispensable part of our daily life. Their 
improved sensing and computing capabilities bring new opportunities for hu¬ 
man behavior monitoring and analysis. Most work so far has been focused on 
detecting correlation rather than causation among features extracted from 
smartphone data. However, pure correlation analysis does not offer sufficient 
understanding of human behavior. Moreover, causation analysis could allow 
scientists to identify factors that have a causal effect on health and well-being 
issues, such as obesity, stress, depression and so on and suggest actions to 
deal with them. Finally, detecting causal relationships in this kind of obser¬ 
vational data is challenging since, in general, subjects cannot be randomly 
exposed to an event. 

In this article, we discuss the design, implementation and evaluation of 
a generic quasi-experimental framework for conducting causation studies on 
human behavior from smartphone data. We demonstrate the effectiveness 
of our approach by investigating the causal impact of several factors such as 
exercise, social interactions and work on stress level. Our results indicate that 
exercising and spending time outside home and working environment have a 
positive effect on participants stress level while reduced working hours only 
slightly impact stress. 

Keywords: smartphone data, causality, human behavior, stress modeling 


1. Introduction 

Nowadays, people generate vast amounts of data through the devices they 
interact with during their daily activities, leaving a rich variety of digital 
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traces. Indeed, our mobile phones have been transformed into powerful de¬ 
vices with increased computational and sensing power, capable of capturing 
any communication activity, including both mediated and face-to-face inter¬ 
actions. User location can be easily monitored and activities (e.g., running, 
walking, standing, traveling on public transit, etc.) can be inferred from raw 
accelerometer data captured by our smartphones BIZI Even more complex 
information such as our emotional state or our stress level can be inferred 
either by processing voice signals captured by means of smartphone’s micro¬ 
phones or by combining information, extracted from several sensors, 

which correlates with our mood ISlElEli. Moreover, we keep track of our 
daily schedule by using digital calendars and we use social media to share 
our experiences, opinions and emotions with our friends. 

Leveraging this rich variety of human-generated information could pro¬ 
vide new insights on a variety of open research questions and issues in sev¬ 
eral scientihc domains such as sociology, psychology, behavioral hnance and 
medicine. For example, several works have demonstrated that online social 
media could act as crowd sensing platforms; the aggregated opinions posted 
in online social media have been used to predict movies revenues [9] , elections 
results ng or even stock market prices m- Social influence effects in social 
networks have been also investigated in several projects either using obser¬ 
vational data [121 US] or by conducting randomized trials [m H^. Other 
works also use mobility traces in order to study social patterns [12] or to 
model the spreading of contagious diseases na. Moreover, the use of smart¬ 
phones is increasingly used to monitor and better understand the causes of 
health problems such as addictions, obesity, stress and depression [181 H]- 
Smartphones enable continuous and unobtrusive monitoring of human be¬ 
havior and, therefore, could allow scientists to conduct large-scale studies 
using real-life data rather than lab constrained experiments. In this direc¬ 
tion, in [20] the authors attempt to explain sleeping disorders reported by 
individuals, by investigating the correlations between sociability, mood and 
sleeping quality, based on data captured by mobile phones sensors and sur¬ 
veys. Also, in [ 21 ] the authors study the links between unhealthy habits, such 
as poor-quality eating and lack of exercise, and the eating and exercise habits 
of the user’s social network. However, both studies are based on correlation 
analysis and, consequently, they are not sufficient for deriving valid conclu¬ 
sions about the causal links between the examined variables. For example, 
an observed correlation between the eating and exercising habits of a social 
group does not necessarily imply that eating and exercise habits of individ- 
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uals are influenced by their social group and, therefore, could be modihed 
by changing someone’s social group. Instead, the observed correlation could 
be due to the fact that people tend to have social relationships with people 
with similar habits. 

The efficient exploitation of human generated data in order to uncover 
causal links among factors of interest remains an open research issue. Some 
works have proposed the use of randomized trials [HE]. According to this 
technique, the causal effects of an event or treatment are examined by ex¬ 
posing a randomly selected subset of participants {treatment group) to this 
event and comparing the result with the corresponding outcome on a con¬ 
trol group (i.e., a subset of participants who have not been exposed to the 
event). By randomly assigning participants to treatment and control groups 
it is assured that, on average, there will be no systematic difference on the 
baseline characteristics of the participants between the two groups. Base¬ 
line characteristics are considered to be any characteristics of the subjects 
that could be related with the study (e.g. in a clinical study the age and 
the previous health status of the subjects could be considered as baseline 
characteristics). While randomized trials represent a reliable way to detect 
causal relationships, they require the direct intervention of scientists in par¬ 
ticipants’ life, which is sometimes unethical or just not feasible. Moreover, 
such experimental studies cannot exploit the vast amount of observational 
data that are produced daily. 

Detecting causal relationships in observational data is challenging since 
subjects cannot be randomly exposed to an event. Thus, subjects that are 
exposed to a treatment may systematically differ from subjects that are not. 
In order to eliminate any bias due to differences on the baseline charac¬ 
teristics of exposed and unexposed subjects, scientists need to gather and 
process information about several factors that could influence the result of 
the study. There are two main methodologies that can be applied to control 
such factors: structural equation modeling [22l [23] and quasi-experimental 
designs [21] • According to the former, the causal effect is estimated using 
multivariate regression. In detail, the variable representing the causal effect 
of an event or treatment is regressed using as predictors the variable repre¬ 
senting the treatment as well as all the baseline characteristics of the subjects 
of the study that could influence the result. Structural equation modeling is 
based on the assumption that the regression model has been correctly spec- 
ihed. False assumptions about the linearity or non-linearity of the model or 
failure to correctly specify the regression coefficients may result in mislead- 
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ing conclusions. On the other hand, methods based on quasi-experimental 
designs do not require the specihcation of a model. Instead, they attempt 
to emulate randomized trials by exploiting inherit characteristics of the ob¬ 
servational data. This can be achieved by comparing groups of treated and 
control subjects with similar baseline characteristics {matching design). 

The purpose of this work is to propose a generic causal inference frame¬ 
work for the analysis of human behavior using digital traces. More specih- 
cally, we demonstrate the potential of automatically processing human gen¬ 
erated observational digital data in order to conduct causal inference studies 
based on quasi-experimental techniques. We support our claim by presenting 
an analysis of the causal effects of daily activities, such as exercising, social¬ 
izing or working, on stress based on data gathered by smartphones from 48 
students that were involved in the StudentsLife project [25] at Dartmouth 
College for a period of 10 weeks. The main goal of the StudentsLife project is 
the study of the mental health, academic performance and behavioral trends 
of this group of students using mobile phones sensor data. To the best of our 
knowledge, this is the hrst work presenting an observational causality study 
using digital data gathered by smartphones. 

Information about participants’ daily social interactions as well as their 
exercise and work/study schedule is not directly measured; instead we use 
raw GPS and accelerometer traces in order to infer high-level information 
which is considered as implicit indicator of the variables of interest. 

No active participation of the users is required, i.e., answering to pop-up 
questionnaires. We automatically assign semantics to locations in order to 
group them in four categories: home, work/university, socialization venues 
and gym/sports center. By grouping locations into these four categories and 
continuously monitoring the spatio-temporal traces of users we can derive 
high-level information as follows: 

• Work/Universify. By analyzing the daily time that users spend at 
their workplace we can infer their working schedule. Prolonged sojourn 
time at work/university could be an indicator of increased workload. 

• Home. The time that participants spend at home could serve as a 
rough indicator of their social interactions. Prolonged sojourn time 
at home could imply limited social interactions or social interactions 
with a restricted number of people. In general, spending time outside 
home usually involves some social interaction. An estimation of the 
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total daily time that participants spend at any place apart from their 
home and working environment could serve as a rough indicator of their 
non-work-related social interactions. 

• Socialization Venues. By monitoring users visits at socialization 
venues such as pubs, bars, restaurants etc, we can infer the time that 
they spend relaxing and socializing outside home during a day. 

• Gym/Sports-center. Indoor workout can be captured by tracking 
participants’ visits to gyms or sports centers. Outdoor activity can be 
measured using accelerometer data. 

2. Causal Inference Framework 

Our causality analysis is based on Rubin’s counterfactual framework |2B] . 
According to this framework, a causal problem is formulated as a counter- 
factual statement which examines what would have been the outcome if an 
object has been exposed to an event. Since it is impossible to observe for the 
same object both the result of exposure and non-exposure to an event, causal 
inference is based on comparing the outcomes on equivalent treatment and 
control groups i.e., treatment and control units with similar baseline charac¬ 
teristics. In this subsection, we discuss a methodology for causal inference in 
observational data. 

The hrst step of the analysis is the description of the variables of the 
study. A causality study involves the following variables: 

1. cause or treatment variable X: an independent variable which influ¬ 
ences the values of another variable. The treatment variable is usually 
binary, denoting whether an object of the study has been exposed to a 
treatment or not. Treatment could be also a discrete variable in case 
that different levels of treatment are considered. 

2. effect or outcome variable Y: a dependent variable which can be ma¬ 
nipulated by changing the variable that represents the cause. 

3. a set of N variables Z = {Z^, ..., Z^}, which describes the baseline 

characteristics of the objects of the study. 

In the second step of the analysis we define the units of the study. Each 
unit corresponds to a set of attributes, derived by the variables of the study, 
which describe an object (e.g., a person or a thing) on a specihc time period. 
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We can use multiple units describing a single object in different time intervals. 
Thus, a unit Mo,t that describes an object o at time t corresponds to a set of 
values {Xu„^, ^uoti ^uot\- Given that, in a causation study, the 

treatment should precede temporally the effect, i.e., the value should 
correspond to the treatment that has been applied to object o before time 
t. In the remainder of the paper, the simplihed notation u will be used to 
describe a unit Uo,t- 

In order to claim that a value of a variable Y has been caused by a value of 
a variable X there should be an association between the occurrence of these 
two values and there should be no other plausible explanation of this associ¬ 
ation [21]. The first part of this requirement can be examined by performing 
a simple statistical analysis. However, excluding any other explanation of 
the observed association is a hard problem since both the treatment and the 
effect variable may be driven by a third variable. Variables that correlate 
with both the outcome and the treatment are called confounding variables 
or confounders. In Figure we provide a graphical representation of the de¬ 
pendencies between the treatment, outcome and confounding variables. The 
identification of the confounders requires a correlation analysis between each 
variable Z* G Z and the variables X and Y. 



Figure 1: Graphical representation of the relationships among the treatment 
X, outcomeV and the set of confounding variables C. 

An unbiased causality study requires that the assignment of units to 
treatments is independent of the outcome conditional to the confounding 
variables. While in experimental studies this requirement is satisfied by 
randomly assigning units to treatments, in observational studies we could 
eliminate confounding bias by comparing units with similar values on their 
confounding variables but different treatment value {matching design). Let 
us consider a binary treatment X, a group of treated units U and a group 
of control units V such as X^ = 1 Wu E U and = 0 Vn G V. Let us 
also consider a set of confounding variables C. Ideally, each unit u E U 
will be matched with a unit n G V if Q = C*, VC* G C. However, perfect 


6 






matching is usually not feasible. Thus, treated units need to be matched 
with the most similar control units. Several methods have been proposed 
to create balanced treated and control pairs m After applying a matching 
method scientists need to check whether the treated and control groups are 
sufficiently balanced by estimating the standardized mean difference between 
the groups or by applying graphical methods such as quantile-quantile plots, 
cumulative distribution functions plots, etc. |2H]. If sufficient balance has 
not been achieved, the applied matching method needs to be revised. 

Finally, if any confounding bias has been sufficiently eliminated, the treat¬ 
ment effect can be estimated by comparing the effect variable Y of the 
matched treated and control units. Let us dehne as G the set of paired 
treated and control units and Ng the number of pairs. Then, the average 
treatment effect (ATE) can be estimated as follows: 

ATE = ^ 

Ng 

In Figure]^ we provide a graphical representation of the causal inference 
methodology. 



Figure 2: Description of the causal inference process in observational data 
using a quasi-experimental matching design. 


3. Dataset Description 

The StudentsLife dataset contains a rich variety of information that was 
captured either through smartphone sensors or through pop-up question¬ 
naires. In this study we use only GPS location traces, accelerometer data. 
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a calendar with the deadlines for the modules that students attend during 
the term and students responses to questionnaires about their stress level. 
Students answer to these questionnaires one or more times per day. 

We use the location traces of the users to create location clusters. GPS 
traces are provided either through GPS or through WiFi or cellular net¬ 
works. For each location cluster, we assign one of the following labels: home, 
work/university, gym/sports-center, socialization venue and other. Labels 
are assigned automatically without the need for user intervention (a detailed 
description of the clustering and location labeling process is presented at the 
additional hie 1) 

We use information extracted from both accelerometer data and location 
traces to infer whether participants had any exercise (either at the gym or 
outdoors). The StudentsLife dataset does not contain raw accelerometer 
data. Instead it provides an activity classihcation by continuously sampling 
and processing accelerometer data. The activities are classihed to stationary, 
walking, running and unknown. 

We also use the calendar with students’ deadlines, which is provided by 
the StudentsLife dataset, as an additional indicator of students workload. 
We dehne as T>^eadiine ^ of days that the student u has a deadline. We 
dehne a variable that represents how many deadlines are close to the 
day d for a user u as follows: 



deadline 

'3 


j—di d Tfiayg <i d ^ j 


E' 


( 2 ) 


0, otherwise 


Thus, will be equal to zero if there are no deadlines within the next 
Tdays days, where T^ays a constant threshold; otherwise, it will be inversely 
proportional to the number of days remaining until the deadline. In our 
experiments we set the Tdays threshold equal to 3. We found that with this 
value the correlation between the stress level of the participants and the 
variable is maximized. 

Finally, the StudentsLife dataset includes responses of the participants to 
the Big Five Personality test m The Big Five Personality traits describe 
human personality using hve dimensions: openness, conscientiousness, ex¬ 
troversion, agreeableness, and neuroticism. The personality traits of partici¬ 
pants can be used to describe some baseline characteristics of the units and, 
for this reason, we include them in the study. 


4. Causality Analysis 

We apply the causal inference framework that was previously described 
in order to assess the causal impact of factors like exercising, socializing, 
working or spending time at home on stress level. 

4.I. Variables 

Initially, we dehne the variables that will be included in the study as 
follows: 

1. denotes the total time in seconds that the user u spent at home 
during day d until time t; 

2. denotes the total time in seconds that the user u spent at uni¬ 
versity during day d until time t; 

3. 0“’'^: denotes the total time in seconds time that the user u spent in 
any place apart from his/her home or university during day d until time 

t] 

4. denotes the total time in seconds that the user u spent exercising 
during day d before time t (it is estimated using both location traces 
and accelerometer data); 

5. S'C/’ : denotes the total time in seconds that the user u spent at any 
socialization or entertainment venue during day d before time f; 

6. denotes the stress level of user u that was reported on day d and 
time t. Stress level is reported one or more times per day. Thus, in 
contrast with the above mentioned variables, is not continuously 
measured; 

7. denotes the last stress level that was reported by user u the day 
d — 1. This variable remains constant within a day; 

8. represents the upcoming deadlines as described in Equation 

9. iV“, C", these hve variables denote the extroversion, neu- 
roticism, agreeableness, conscientiousness and openness of user u based 
on his Big Five Personality Traits score respectively. 

4-2. Units 

In this study, we examine the effects of hve treatments, denoted by the 
variables and SC^f on the stress level of participants, 

which is described by the variable A unit of the study corresponds 

to a set of attributes derived by the variables of the experiment. All the 
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variables are sampled every 4 hours, thus there are maximum six samples 
per day for each participant. Let T = {4am, 8am, 12pm, 16pm, 20pm, 24pm} 
a set of sampling times and ti the element of T. Then a unit corresponds 
to the set of variables 

Since the variable is not continuously measured, it is not feasible 
to sample it for time t^. Instead, we dehne as the average stress level of 
unit u at day d between time ti and tj+i. Thus, is estimated as follows: 


Sif = for ti<t<t 


^+l (3) 

If there are no stress level reports during this time interval, then the unit 
that corresponds to the set of variables P“’'^ will be discarded. 


4-3. Detection of Confounding Variables 

In order to conduct a reliable causation study based on observational data 
we need to dehne the confounding variables. While there is a large number 
of factors that could inhuence the stress level of participants, the study could 
be biased only by factors that have a direct inhuence on both the stress level 
and the variable that is considered as treatment in the study. Thus, in our 
case we need to specify factors that could inhuence both the daily activities 
of participants and their stress level. For example, the workload of students 
can inhuence their activities (e.g., in periods with increased workload some 
students may choose to change their workout schedule, etc.) and their stress 
level. Since the workload cannot be directly measured using only sensor 
data from smartphones, we use as confounding variables other variables that 
provide implicit indicators of workload such as the time that students spend 
at home and university and their deadlines. Moreover, participants choice to 
do an activity may exclude another activity from their schedule and it may 
also inhuence their stress level. For example, someone may choose to spend 
some time in a pub instead of following his/her normal workout schedule. 
The previous day stress level may also inhuence both next day’s activities 
and stress level. Finally, several studies have demonstrated that stress level 
huctuations are ahected by personality traits [H]. In general, more positive 
and extrovert people tend to be able to handle stress better than people with 
high neuroticism score. Moreover, personality characteristics may correlate 
with the daily schedule that people follow. For example more extrovert people 
may spend less time at home and more time in social activities. In order to 
dehne the covariates of the study we conduct a correlation analysis on the 
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variables of interest. Since the relationship among the variables may not be 
linear, we apply the Kendall rank correlation. The p-values of the Kendall 
correlation are presented in Table [T] 
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0.078 

0.01 

0.47 

0.352 

0.214 


0.604 

0.006 

0.005 

2.1 - 10-5 

4.7-10-4 

0.95 


Table 1; P-values of Kendall correlation under the null-hypothesis that the 
examined variables are independent. 


Based on these results, the time that students spend at home does not 
correlate with their stress level. Thus, the variable will not be in¬ 
cluded in the causality study. The causal impact of each treatment variable 
and SC^f on the effect variable will be examined using 
as confounding variables all the variables that correlate with both the treat¬ 
ment and effect based on Table We consider a correlation to be signihcant 
enough if the p-value is smaller than 0.1. In Table we present the con¬ 
founding variables that will be used for each examined treatment. While the 
variables O^f and are strongly correlated, we do not include SC^f 

in the set of confounding variables when the treatment is the variable 
since our goal is to study the impact of spending time in any place (including 
socialization venues) apart from home and working environment. 

4-4- Creation of Treated and Control Croups 

After dehning the confounding variables of the study, we need to split 
the units into control and treatment groups. We consider binary treatments 
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Table 2: Confounding Variables for the different applied treatments. 


by applying thresholds to the examined treatment variables. Thus, for each 
of the four examined treatments (i.e., U^f., O^f, SC^f) the units are 
split as follows: 

1. treatment units are all the units with — a ■ 

E{UJ^’'^} and control all the units with > E{U^''^} + a ■ 

for a constant a G [0,1). Thus, we consider to have a positive treatment 
value when the university sojourn time is relatively small. 

2. treatment units are all the units with > E{Ui^^} + a. ■ 

E{UJ^’'^} and control all the units with — a ■ E{U^'‘^}. 

Thus, we consider to have a positive treatment value when the time 
spent in any non-work-related place outside home is relatively large. 

3. treatment units are all the units with > 0 i.e. all the units 
that denote that a user u had some exercise at day d before time t. In 
the control group are units with E^f = 0 

4. similarly to the treatment variable E^f^ treatment units are 

units with > 0 and control units with = 0 

Thus, when the treatment variables and are considered, units 
are classihed to treated and untreated based on the time they have spent at 
university or at any place apart from their home and university respectively. 
However, in order to examine the impact of exercising and visiting socializa¬ 
tion venues, the binary treatments are dehned by considering only whether 
there was some exercising activity or a visit to a socialization place or not. 
We do not study the impact of these factors by considering also the duration 
of these events since the amount of the data is not sufficiently large. 

Each of the examined treatment variables describes some user behavior 
or activity from the start of the day to some time tj. Consequently, the 
comparison of two units with different sampling times ti is not valid. Thus, 
we create a group of pairs of treated and control units G*. for each one of 
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the 6 sampling times ti such that each treated unit is matched with a 

control unit with similar values on its confounding variables. Then, 

the average treatment effect is estimated as follows: 

ate =_ — _ — _ - _^_ - _ - _ 14') 

If there is no causal effect of the examined treatment on the stress level 
then the average treatment effect should be zero. We use a t-test in or¬ 
der to decide whether the observed average treatment effect is statistically 
signihcant. 


4 . 5 . Balance Check 

In order to create balanced treated and control pairs of units we apply the 
Genetic Matching method [30]. Genetic Matching is a multivariate match¬ 
ing method that applies an evolutionary searching algorithm that estimates 
weights for each confounding variable in order to achieve an optimal covari¬ 
ates balance. In order to assess if the treated and control pairs are sufficiently 
balanced, we check the standardized mean difference for each confounding 
variables of the study. We indicate with C the set of confounding variables. 
For each confounding variable c E C, the standardized mean difference is 
estimated as follows: 


SMDc 






,,dy 


)eGt, 


(g 


,(u,d) 


Evu ^Gt^ 



(5) 


where denotes the variance of the confounding variable c for the 

treated units. The remaining bias from a confounding variable c is considered 
to be insignihcant if SMDc is smaller than 0.1 [28] . 


5. Results 

According to Table extroversion and neuroticism are the two person¬ 
ality characteristics that strongly correlate with stress level. Thus, we in¬ 
vestigate whether some of the examined treatments have a different causal 
impact on people with high extroversion or neuroticism scores. We conduct 
our study collectively for the whole population and selectively for people with 
high extroversion score and people with high neuroticism score. We dehne 
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(a) Causal effect of variable on the (b) Causal effect of variable on the 
stress level for the four examined thresh- stress level for the four examined thresh¬ 
olds olds 


Causal impact of exercise and socialization on stress 



-2 .;. 

-4 1 -^^- 

Exercise Scocialization 


(c) Causal effect of variables and 

on the stress level 

•'i 

Figure 3: Percentage improvement on the stress level of treated units com¬ 
pared to control units when each one of the examined treatments is applied. 
Percentage improvement is estimated as —^ 100- 

as Extroverts all the participants with extroversion score larger than the av¬ 
erage extroversion score of all the participants. Correspondingly, we dehne a 
subpopulation of Neurotics composed of participants with neuroticism score 
higher than the average score of the population. 

In Figure we present the average treatment effect (ATE) normalized 
by the average stress level of the control units along with the 95% conh- 
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dence intervals for each one of the four examined treatment variables. For 
the treatment variables and we present results for a equal to 0, 
0.05, 0.1 and 0.15. We do not present results for larger a values since the 
number of samples that are discarded is large and the remaining data are not 
sufficient for statistically signihcant conclusions. In Figure]^ and Tablewe 
present the standardized difference, as described in Equation for all the 
confounding variables that were used in each one of the causation studies. 
According to our results, the standardized difference between treated and 
control samples is smaller than 0.1 for all the confounding variables thus any 
confounding bias has been sufficiently minimized. 

Our results indicate that the time that students spend at university has 
only a weak causal impact on the stress level when participants’ samples are 
split into treatment and control groups using an a value equal to 0.15. In de¬ 
tail, participants report 3.1% (with conhdence interval ±0.7%) lower stress 
level the days that their sojourn time at university is 15% lower than the 
average university sojourn time of the whole population compared to days 
that the university sojourn time is 15% larger than usual. However, when 
the analysis is limited to people with high extroversion score, there is no sta¬ 
tistically signihcant evidence that the time that students spend at university 
has any causal effect on stress. When smaller a values are considered, the 
causality score is close to zero for the examined set of students as well as for 
the Extroverts and Neurotics sub-populations. 

Based on our results, the time that students spend in any place apart 
from their home and university has a signihcantly strong causal impact on 
their stress level. As depicted in Fig. ib, students have reported around 
3% (with conhdence interval ±0.65%) lower stress level the days that they 
spend more time outside than the average time compared to days that they 
spend less time outside (i.e., a = 0), when the whole set of participants is 
considered. Similar results are observed when the Extroverts and Neurotics 
sub-populations are examined (the observed diherence is not statistically 
signihcant given the 95% conhdence intervals of the study). When the value 
of a is increased, the causal impact of the examined variable is stronger. 
For a = 0.15, the improvement on the stress level for students who spend 
more time outside is 14.45% (with conhdence interval ±1.5%) when the total 
population is considered. The results are similar when the study is limited 
to students with high extroversion score. However, the examined variable 
has a signihcantly lower impact on stress level when the sub-population of 
Neurotics is considered. 
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Balancing of confounders for treatment Uj'’ ^ Balancing of confounders for treatment Oj'’ ^ 



Figure 4: Standardized difference between treated and control samples for 
each confounding variable when the applied treatment is (a) the variable 
and (b) the variable The standardized difference for all the confounding 
variables is less than 0.1, thus the groups are balanced. 


In Fig. [^c, we examine the impact of exercising or visiting socialization 
venues on stress level. While the variable is strongly correlated with 

the stress level, according to our results, there is no causal link between 
them. This indicates that, while people beneht from spending time outside 
home or working environment in general, there is no statistically signihcant 
beneht from visiting specihc venues. Finally, exercising has positive effect on 
the stress level of the examined population. However, results are different 
when the Extroverts and Neurotics sub-populations are examined separately. 
Exercising has a stronger positive effect on the stress level of participants 
with high neuroticism score while there is no statistically signihcant beneht 
for people with high extroversion score. 



pgu,d 

j^u^d 

UU,d 

OU,d 


iV“ 



-0.0035 

0.0442 

0.0046 

-0.0148 

-0.0069 

-0.0065 

0.0001 

j^u,d 

- 

0087 

-0.0011 

- 

0.0047 

0 

0.0043 


Table 3: Standardized diherence between treated and control samples for 
each one of the confounding variables when the applied treatments corre¬ 
spond to the variables SC'^f and E^f. 
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6. Discussion 

In this work, we presented a framework for detecting causal links on hu¬ 
man behavior using mobile phones sensor data. We have studied the causal 
effects of several factors, such as working, exercising and socializing, on stress 
level of 48 students using data captured by smartphones sensors. Our results 
suggest that exercising and spending time outside home or university have 
a strongly positive causal effect on participants stress level. We have also 
demonstrated that the time participants stay at university has a positive 
causal impact on their stress level only when it is considerably lower than 
the average daily university sojourn time. However, this impact is not re¬ 
markable. 

Moreover, we have observed that some of the examined factors have dif¬ 
ferent impact on the stress level of students with high extroversion score and 
on students with high neuroticism score. More specihcally, more extrovert 
students beneht more from spending time outside home or university while 
more neurotic students beneht more from exercising. 

Our study mainly relies on raw sensor data that can be easily captured 
with smartphones. We have demonstrated that information extracted by 
simply monitoring users’ location and activity (through accelerometer) can 
serve as an implicit indicator of several factors of interest such as their work¬ 
ing and exercising schedule as well as their daily social interactions. Inferring 
this high-level information using raw sensor data instead of pop-up question¬ 
naires has three main advantages: 1) it offers a more accurate representation 
of participants activities over time since data are collected continuously, 2) 
data are collected in an obtrusive way without requiring participants to pro¬ 
vide any feedback; this minimizes the risk that some users will quit the study 
because they are dissatished by the amount of feedback that they need to 
provide, 3) data gathered through pop-up questionnaires may not be ob¬ 
jective since participants may provide either intentionally or unintentionally 
false responses. On the other hand, inferences based on sensor data could 
also be inaccurate either due to noisy sensor measurements or due to the 
fact that the variable of interest is inferred by the sensed data rather than 
directly measured. For example, in our case we assume that a visit to a 
sports center implies that the user had some exercise. However, the user 
may have visited this place to attend a sports event or just to meet friends. 
Assessing the degree of uncertainty that information inference from sensor 
measurements involves and incorporating this uncertainty into the causation 
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study represents an interesting research area for further investigation. 

Finally, this study involves a limited number of participants who do not 
constitute a representative sample of the population; therefore extrapolating 
general conclusions about the causal impact of the examined factors on stress 
level is not feasible. However, the purpose of this article is to demonstrate 
the potential of utilizing smartphones in order to conduct large-scale studies 
related to human behavior, rather than present a thorough investigation on 
factors that influence stress. 

Appendix A. Location Clustering and Labeling 

We create location clusters using raw GPS traces. In order to increase the 
accuracy on location estimation we consider only GPS samples with accuracy 
less than 50 meters. Moreover, we ignore any samples that were collected 
while the user was moving. For each new GPS point, we create a cluster only 
if the distance of this point with the centroid of any of the existing clusters is 
more than 50 meters. Otherwise, we update the corresponding cluster with 
the new GPS sample. Every time a new GPS sample is added to a cluster, 
the centroid of the cluster is also updated. The pseudo code of the location 
clustering algorithm is presented at Algorithm 1. 

Each location cluster is labeled as home, work/university, gym/sports- 
center, socialization venue or other. The label socialization venue is used 
to describe places like pubs, bars, restaurants and cafeterias. The label 
other is used to describe any place that does not belong to the above men¬ 
tioned categories. We label as home the place that people spend most 
of the night and early morning hours. In order to hnd clusters that cor¬ 
respond to gyms/sports-centers or socialization venues we use the Google 
Maps JavaScript API [21]. Google Maps JavaScript API enable developers 
to search for specihc type of places that are close to a GPS point. The type of 
place is specihed using specihc keywords from a list of keywords provided by 
this API. We use the centroid of each unlabeled cluster to search for nearby 
places of interest. Places that correspond to gym/sports centers are specihed 
by the keyword gym and places that correspond to socialization venues are 
specihed by the keywords bar, cafe, movieJheater, night-club and restaurant. 
For each unlabeled cluster we conduct a search for nearby points of inter¬ 
ests. If a point of interest with distance less than 50 meters from the cluster 
centroid is found, we label the cluster as gym/sport-center or socialization 
venue depending on the point of interest type. Otherwise the cluster is la- 
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Data: Set of location points L = {/i, /2, 
Result: Set of Clusters C = {ci, C 2 ,Cm} 

for each I E L do 

if accuracy(l)> 50 then 
I continue; 

end 

locationClusteredFlag := 0; 
for each c E C do 

H := E F}; 

if distance(l, centroid(c))<50 then 
c := c U {/}; 

locationClusteredFlag := 1; 
break; 

end 

end 

if locationClusteredFlag = 0 then 
newCluster := {/}; 

C := C U {newCluster}; 

end 

end 


Algorithm 1: Location clustering 


beled as other. Any place within the university campus that is not labeled 
as gym/sport-center or socialization venue is labeled as work/university. 

Appendix B. Matching Method 

For matching the treatment and control units we use the Matchit R pack¬ 
age [32] which includes an implementation of the genetic matching algorithm 
described above. Several optimization criteria can be used with Genetic 
Matching [33|. Here, the balance metric that the Genetic Matching algo¬ 
rithm optimizes is the mean standardized difference of all the confounding 
variables. We use matching with replacement, i.e., each control unit can be 
matched to more than one treatment units. Matching with replacement can 
reduce the bias since control units which are very similar to treatment units 
can be exploited more. We use a matching ratio equal to 2. This means that 
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each treatment unit will be matched with up to 2 control units. 
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